###knit to an .md file by changing the output in the
YAML header to github_document, before submitting a tagged
release on canvas.* ###submit a link to your tagged release on
canvas.
This is the Mini Data-Analysis Deliverable 2 for STATS 545A by April Hwang. It demonstrates my knowledge in making summary tables/graphs, manipulating data, model fitting, and reading/writing data as separate files.
Research question 1 : What, if any, is the relationship between time and allowance of pet ownership in this dataset? Similarly, is there any relationship between time and availability of accessible parking?
Research question 2 : Is there a relationship between the number of units and the number of storeys?
Research question 3 : What are the variables for which information is most readily available (i.e. no missing data) and least readily available (i.e. large amount of missing data)?
Research question 4 : What are the top three most popular building amenities?
I will create a new categorical variable, year_level, which identifies the century in which the building was built as 19th century, 20th century, and 21st century.
#create year_level and move it forward as first column
apt_buildings %>%
mutate(year_level = case_when(year_built < 1901 ~ '19th century',
year_built < 2001 ~ '20th century',
TRUE ~ '21st century')) %>%
relocate(year_level)I will create a graph using the above year_level on the
x-axis and no_of_accessible_parking_spaces on the y-axis. I
will use geom_boxplot() and geom_jitter(). I
will also remove a few extreme outliers that contain potentially
questionable values of having over 150 accessible parking spaces.
#create graph using geom_boxplot() and geom_jitter()
apt_buildings %>%
mutate(year_level = case_when(year_built < 1901 ~ '19th century',
year_built < 2001 ~ '20th century',
TRUE ~ '21st century')) %>%
ggplot(aes(x = year_level, y = no_of_accessible_parking_spaces)) +
geom_boxplot(colour = 'grey') +
geom_jitter(alpha = 0.2, colour = 'steelblue') +
scale_y_log10('no_of_accessible_parking_spaces') +
theme_classic()
I will create a new categorical variable, rise_level, which classifies each building as low-rise, mid-rise or high-rise. For this exercise, low-rise buildings will be defined as buildings with 10 floors or under. Mid-rise buildings are defined as buildings that have between 11 to 15 floors. High-rise buildings are defined as buildings that have 16 floors or above.
#create rise_level and move it forward as first column
apt_buildings %>%
mutate(rise_level = case_when(no_of_storeys < 11 ~ 'low-rise',
no_of_storeys < 16 ~ 'mid-rise',
TRUE ~ 'high-rise')) %>%
relocate(rise_level)I will create a graph using geom_jitter() and
geom_smooth(), incorporating the variable created above,
rise_level. I will remove one outlier that is listed as having
over 4000 units, which may be an error.
#create graph using geom_jitter() and geom_smooth()
apt_buildings %>%
mutate(rise_level = case_when(no_of_storeys < 11 ~ 'low-rise',
no_of_storeys < 16 ~ 'mid-rise',
TRUE ~ 'high-rise')) %>%
ggplot(aes(x = no_of_units, y = no_of_storeys, colour = rise_level)) +
geom_jitter(outlier.shape = NA, alpha = 0.2) +
coord_cartesian(xlim = c(0, 700)) +
geom_smooth() +
theme_classic()
For this task, I will compute the number of NAs for all variables in
apt_buildings and generate a new tibble with the new
observations. I will also count the total number of levels as a
reference.
#count NA and tidy data
apt_buildings_NA <- apt_buildings %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = 'col_name', values_to = 'NA_count')
print(apt_buildings_NA)## # A tibble: 37 × 2
## col_name NA_count
## <chr> <int>
## 1 id 0
## 2 air_conditioning 85
## 3 amenities 2518
## 4 balconies 88
## 5 barrier_free_accessibilty_entr 82
## 6 bike_parking 0
## 7 exterior_fire_escape 95
## 8 fire_alarm 87
## 9 garbage_chutes 83
## 10 heating_type 86
## # ℹ 27 more rows
## [1] 37
I will create a bar graph using geom_col which shows
five columns with the highest NA counts. The y-axis will be logarithmic.
It will be arranged in a descending order and will include clean and
non-overlapping x- and y-axis labels as well as a title for the
graph.
#count NA and tidy data
apt_buildings_NA <- apt_buildings %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = 'col_name', values_to = 'NA_count') %>%
arrange(desc(NA_count))
#select 5 columns with highest NA counts
apt_buildings_NA %>%
slice_max(NA_count, n = 5) %>%
#shorten column names
mutate(col_name = fct_recode(col_name,
'prop_management' = 'prop_management_company_name',
'accessible_parking' = 'no_of_accessible_parking_spaces',
'barrier_free_accessible_units' = 'no_barrier_free_accessible_units')) %>%
#plot graph
ggplot(aes(x = fct_rev(fct_reorder(col_name, NA_count)), y = NA_count)) +
geom_col(fill = 'steelblue') +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
ggtitle('Columns with Highest NA Count') +
xlab('Column') +
ylab('NA count') +
theme_minimal()
I will convert the amenities variable to a factor, then compute the counts of each level. Then I will tidy the data by converting each cell into a single value, discarding irrelevant columns and omitting NAs. For this research question, popularity of the amenities will be quantified by its frequency.
#convert amenities to factor
amenities_fct <- as.factor(apt_buildings$amenities)
#compute counts of each level
amenities_count <- fct_count(amenities_fct)
#convert each cell into single value and tidy data
amenities_count %>%
separate(f, into = c('f1','f2','f3', 'f4','f5','f6'), sep = ' , ') %>%
pivot_longer(c('f1','f2','f3', 'f4','f5','f6'), names_to = 'f', values_to = 'amenities_type') %>%
select(amenities_type) %>%
na.omit() %>%
#count each amenity type
count(amenities_type)I will create a bar graph using geom_bar which shows the
frequency of each type of amenities using the new tibble from the above
summary task. The y-axis will be logarithmic. It will be arranged in an
ascending order and will include clean and non-overlapping x- and y-axis
labels as well as a title for the graph.
#convert amenities to factor
amenities_fct <- as.factor(apt_buildings$amenities)
#compute counts of each level
amenities_count <- fct_count(amenities_fct)
#convert each cell into single value and tidy data
amenities_count %>%
separate(f, into = c('f1','f2','f3', 'f4','f5','f6'), sep = ' , ') %>%
pivot_longer(c('f1','f2','f3', 'f4','f5','f6'), names_to = 'f', values_to = 'amenities_type') %>%
select(amenities_type) %>%
na.omit() %>%
#plot graph
ggplot(aes(x = fct_rev(fct_infreq(amenities_type)))) +
geom_bar(fill = 'steelblue', width = 0.9) +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
ggtitle('Amenity Type Frequency') +
xlab('Type') +
ylab('Frequency') +
theme_minimal()
Research question 1 : What, if any, is the relationship between time and allowance of pet ownership in this dataset? Similarly, is there any relationship between time and availability of accessible parking?
I now have more information to answer the second half of this research question regarding the relationship between time and availability of accessible parking. It appears that there is a disproportionately large quantity of data for buildings that were built in the 20th century, thus this question can be further refined to examine data from within the 20th century. The relationship between time and allowance of pet ownership still remains unclear and it would be better if this research question was separated into two distinct questions.
Research question 2 : Is there a relationship between the number of units and the number of storeys?
The above operations demonstrate that there is a positive correlation between the number of units and the number of storeys. This is interesting as it aligns with my hypothesis and suggests that buildings with larger units are being built higher rather than wider.
Research question 3 : What are the variables for which information is most readily available (i.e. no missing data) and least readily available (i.e. large amount of missing data)?
The above operations showed that some variables, such as id and bike_parking, do not contain any missing data, whereas other variables, such as amenities and prop_management_company_name, contain a high number of missing data. The results of this question is of interest as the variation in NA count for the different variables is noticeably large. Furthermore, this research question is of a personal interest as filtering and navigating through missing data is an important aspect of my personal research.
Research question 4 : What are the top three most popular building amenities?
According to this dataset, the top three most popular building amenities are outdoor recreation facilities, indoor recreation room and indoor exercise room.
apt_buildings is untidy for the following reasons:
apt_buildings does contain some aspects of a tidy dataset in that each variable is a column and each column is a variable, and each observation is a row and each row is an observation.
In section 1.2, I demonstrated tidying the data for amenities and NA_count. For this task, I will tidy facilities_available. One of the values for this variable contains multiple measurements (‘Green Bin / Organics’) and it also reports missing data with two separate labels (‘Unknown’ and ‘Not Available’).
Before (untidy):
#convert chr to fct
apt_buildings_fct <- apt_buildings %>% mutate(across(where(is_character),as_factor))
#identify levels which are untidy
levels(apt_buildings_fct$facilities_available) ## [1] "Recycling bins" "Green Bin / Organics" "Unknown"
## [4] "Not Available"
After (tidy then untidy):
#convert chr to fct
apt_buildings_fct <- apt_buildings %>% mutate(across(where(is_character),as_factor))
#tidy data (see note)
apt_buildings_tidy <- apt_buildings_fct %>%
mutate(facilities_available = fct_recode(facilities_available,
'Compost' = 'Green Bin / Organics',
'NA' = 'Unknown',
'NA.' = 'Not Available'))
#here^ I am marking the second NA with a period (.) as this specific task requires for me to immediate untidy the data after tidying it. In most cases, I would use the same NA for both 'Unknown' and 'Not Available' as they both indicate missing data
#check that data is tidy
print(select(apt_buildings_tidy,facilities_available))## # A tibble: 3,455 × 1
## facilities_available
## <fct>
## 1 Recycling bins
## 2 Compost
## 3 Compost
## 4 Compost
## 5 Compost
## 6 Compost
## 7 NA
## 8 Compost
## 9 Recycling bins
## 10 Recycling bins
## # ℹ 3,445 more rows
#untidy data
apt_buildings_untidy <- apt_buildings_tidy %>%
mutate(facilities_available = fct_recode(facilities_available,
'Green Bin / Organics' = 'Compost',
'Unknown' = 'NA',
'Not available' = 'NA.'))
#check that data is untidy
print(select(apt_buildings_untidy,facilities_available))## # A tibble: 3,455 × 1
## facilities_available
## <fct>
## 1 Recycling bins
## 2 Green Bin / Organics
## 3 Green Bin / Organics
## 4 Green Bin / Organics
## 5 Green Bin / Organics
## 6 Green Bin / Organics
## 7 Unknown
## 8 Green Bin / Organics
## 9 Recycling bins
## 10 Recycling bins
## # ℹ 3,445 more rows
Is there a relationship between the number of units and the number of storeys?
What are the variables for which information is most readily available (i.e. no missing data) and least readily available (i.e. large amount of missing data)?
I chose the above two questions as the first choice demonstrated promising results of which confirms my hypothesis of positive correlation between the number of units and the number of storeys. It will be interesting to continue with the analysis to explore the robustness of this correlation. The second choice is of a personal interest as managing missing data is an important part of my personal research and I would like to gain more experience in working with missing data.
Version of data for exploring relationship between units and storeys
#prep data using 4-8 functions
apt_buildings_unit_storeys <- apt_buildings %>%
#use select() to drop irrelevant columns
select(no_of_storeys, no_of_units) %>%
#use arrange() to reorder rows
relocate(no_of_units) %>%
#use desc() to sort rows in descending order
arrange(desc(no_of_units)) %>%
#use rename() to shorten column labels
rename(storeys = no_of_storeys,
units = no_of_units) %>%
#use slice() to remove first outlier row with 4000+ units
slice(-1)
print(apt_buildings_unit_storeys)## # A tibble: 3,454 × 2
## units storeys
## <dbl> <dbl>
## 1 719 29
## 2 638 28
## 3 587 29
## 4 584 29
## 5 575 22
## 6 575 30
## 7 571 32
## 8 568 22
## 9 565 32
## 10 561 25
## # ℹ 3,444 more rows
Version of data for counting NA
#prep data using 4-8 functions
apt_buildings_NA <- apt_buildings %>%
#use summarise(), across(), everything() and is.na() to count NA
summarise(across(everything(), ~ sum(is.na(.)))) %>%
#use pivot_longer() to tidy data
pivot_longer(cols = everything(), names_to = 'col_name', values_to = 'NA_count') %>%
#use desc() to sort rows in descending order
arrange(desc(NA_count))
print(apt_buildings_NA)## # A tibble: 37 × 2
## col_name NA_count
## <chr> <int>
## 1 amenities 2518
## 2 prop_management_company_name 1363
## 3 parking_type 350
## 4 no_barrier_free_accessible_units 154
## 5 no_of_accessible_parking_spaces 123
## 6 exterior_fire_escape 95
## 7 non-smoking_building 94
## 8 intercom 90
## 9 pets_allowed 90
## 10 year_registered 89
## # ℹ 27 more rows
Research Question: Is there a relationship between the number of units and the number of storeys?
Variable of interest: Number of units
I will fit a model that makes predictions on the number of units
using the number of storeys by using the lm() function.
#linear regression modeling
apt_buildings_model_fit <- apt_buildings_unit_storeys %>%
ggplot(aes(x = storeys, y = units)) +
geom_point(colour = 'steelblue') +
geom_smooth(method = 'lm', colour = 'black') +
theme_minimal()
print(apt_buildings_model_fit)
output folder using here().#write summary table as csv file using here()
write_csv(apt_buildings_NA, here('output', 'apt_buildings_NA.csv'))#use saveRDS() to write model object to RDS file
saveRDS(apt_buildings_lm, file = here('output', 'apt_buildings_lm.rds'))
#use readRDS() to load RDS file
readRDS(here('output', 'apt_buildings_lm.rds'))##
## Call:
## lm(formula = no_of_units ~ no_of_storeys, data = apt_buildings)
##
## Coefficients:
## (Intercept) no_of_storeys
## -14.00 13.58